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Abstract 

Working with exhaustive search on large dataset is infeasible for sev¬ 
eral reasons. Recently, developed techniques that made pattern set min¬ 
ing feasible by a general solver with long execution time that supports 
heuristic search and are limited to small datasets only. In this paper, we 
investigate an approach which aims to find diverse set of patterns using 
genetic algorithm to mine diverse frequent patterns. We propose a fast 
heuristic search algorithm that outperforms state-of-the-art methods on 
a standard set of benchmarks and capable to produce satisfactory results 
within a short period of time. Our proposed algorithm uses a relative 
encoding scheme for the patterns and an effective twin removal technique 
to ensure diversity throughout the search. 

Keywords-paXtevn set mining; concept learning; genetic algorithm; op¬ 
timization. 


1 Introduction 

Recently pattern set mining has been used instead of pattern mining [1]. In 
pattern set mining, the aim is to find a small set of patterns in data that 
successfully partitions the dataset and discriminates the classes from one another 
[6]. Many algorithms have been proposed in last few years to find such sets of 
patterns [1]. When the search space is too large or it is required to select a 
small set of patterns from a large dataset, exhaustive search techniques do not 
perform well. 

Large data is challenging for most existing discovery algorithms because 
many variants of essentially the same pattern exist, due to (numeric) attributes 
of high cardinality, correlated attributes, and so on. While ignoring many po¬ 
tentially interesting results, this causes top-fc mining algorithms to return highly 
redundant result sets. These problems are particularly apparent with pattern 
set discovery and its generalisation, exceptional model mining. To address this, 
we deal with the discriminative or diverse pattern set mining problem. We are 
given a set of transactions and a set of patterns in the concept learning set up 
to select a small set of diverse patterns. In last few years, many algorithms that 
are proposed to solve the problem which are mostly exhaustive or greedy in 
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nature [6]. Constraint programming methods on a declarative framework [4,6] 
have earned significant success. However, these algorithms perform very poorly 
for large datasets and requires huge time, where local search methods have been 
very effective to find satisfactory results efficiently. 

We investigate the possibilities for studying diverse pattern sets to find small 
set of patterns within a short period of time using genetic algorithm with re¬ 
spect to a particular purpose by using a large datasets with minor modifications 
in the search technique. Our genetic algorithm has several novel components: 
a relative encoding technique learned from the structures in the dataset, a twin 
removal technique to remove identical and redundant individuals in the popu¬ 
lation and a random restart technique to avoid stagnation. We compared the 
performance with several other algorithms: random walk, hill climbing and large 
neighborhood search. The key contributions in the paper are as follows: 

• Demonstrate the overall strength of our genetic algorithm for finding small 
set of diverse pattern. 

• Perform a comparative analysis between various types of local search al¬ 
gorithm and analysis of their relative strength compared with each other. 

The paper is furnished as follows. In preliminaries section we explain our 
work and all the necessary definitions to understand the paper. In related work, 
we explain the previous task. In our approach part, we explain our algorithms 
and in experimental part, we explain our results and then conclude with a 
discussion and a possible outline for future work. 


2 Preliminaries 

2.1 Pattern Constraints 

In this section, we explain some concepts to understand the diverse pattern set 
mining problems. These notations are adopted from Guns et al. [6]. 

We assume that we are given a set of items X and a database, T) of transac¬ 
tions T, in which all elements are either 0 or 1. The process of finding the set 
of patterns which satisfy all of the constraints is called pattern set mining. A 
pair of variables (1, T), where / represents an itemset /Cl and T represents a 
transaction set T QT represented by means of boolean variables A and for 
every item z € T and every transaction t € T. 

The itemsets or pattern sets and the transaction sets are generally repre¬ 
sented by binary vectors. The coverage (p-Dil) of an itemset I consists of all 
transactions in which the itemset occurs: 

ipvii) = {t e T\yt e I: Vu = 1} 

For example, consider the small dataset presented in Table I. Given an 
itemset, / = {B,C}, it is represented as (0,1,1,0, 0) and the the coverage is 
(p'D{I) = {t 2 , ts} which is represented by (0,1, 0, 0,1,0). Support of the itemset 
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Table 1: A small example dataset containing five items and six transactions. 


Transaction 

Id 

ItemSet 

A 

B 

c 

D 

E 

Class 


{A,B,D} 

1 

1 

0 

1 

0 

+ 

t2 

{B,C} 

0 

1 

1 

0 

0 

+ 

h 

{A,D} 

1 

0 

0 

1 

0 

+ 

t4 

{A,C,D} 

1 

0 

1 

1 

0 

- 

h 

{B,C,D} 

0 

1 

1 

1 

0 

- 

^6 

{C,D,E} 

0 

0 

1 

1 

1 

- 


is Support-D^) = 2. Where, Support of an itemset is the size of its coverage 
set, Supportv{I) = \ipv{I)\- 

The dispersion score is the score of the frequent pattern sets based on the 
items categories within it. For example, for pattern set size, fc = 3, given three 
itemsets h = {B,C}, I 2 = {C,D} and I 3 = {E} in the pattern sets and 
their coverage will be = (0,1, 0, 0,1, 0), pvih) = (0,0, 0,1,1,1) and 

wih) = (0,0, 0, 0, 0,1) respectively. After XOR operation to each other, the 
sum of each item of the coverage will be 


P'D{h)xorpT>{h) = (0,1, 0,1,0,1) = 3, 
ip-D{h)xor(p-D{h) = (0,1, 0,0,1,1) = 3, 
‘P'D{h)xorip-D{h) = ( 0 , 0 , 0 , 1 , 1 , 0 ) = 2 . 


Now, the result of the dispersion score will be 3 + 3 + 2 = 8. 

2.2 Pattern Set Constraints 

In pattern set mining, we are interested to find fc—pattern sets [5]. A /c—pattern 
set n is a set of k tuples, each of type {IP,TP). The pattern set is formally 
defined as following: 


n = {tti, • • • ,7rfc}, where, Vp = 1, - • • , fc : TTp = {P,TP) 

Diverse pattern sets: In pattern set mining, highly similar transaction sets 
can be founded which can be undesirable. To avoid this, many measures can 
be used to find the similarity between two set of patterns such as dispersion 
score [11]: 


dispersion{T\P) = ^{2Tl - l)(2r/ - 1). 
t&T 
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The term {2TI — 1) transforms a binary {0,1} variable into one of range {—1,1}. 
This way of finding dispersion score has some disadvantages. When two patterns 
cover exactly the same transactions and one pattern covers exactly the opposite 
transactions of the other, the score will be maximized in both. For example, 
if two patterns cover (0,1,1, 0, 0,1) and (1, 0,0,1,1, 0) or (0,1,1, 0, 0,1) and 
(0,1,1, 0, 0,1) transactions respectively, in both case, the score will be 6 [6]. 
This is not exactly desirable because in second case, it will must be 0. To 
address this issue, we define and propose a new XOR based dispersion score to 
calculate the diversity between two pattern sets. 

xorDispersion{T^,T^) = ^ Tl 0 T/. 

ier 

To measure the diversity of a pattern set we use the following expression 
which is the objective function that we wish to maximize. 

k i — 1 

objDispersicm = EE xorDispersion(T^, T^). 

i=ij=i 

To find diverse-frequent patterns, in last few years, most of the algorithms 
too struggles to produce good quality solutions on the large datasets within a 
short period of time. In this paper, to solve this problem, we proposed a XOR 
based genetic algorithm with various novel components which worked with large 
datasets. 


3 Related Work 

Many variants of pattern set mining are investigated in the literature. Among 
them to find patterns which are correlated [10], discriminative [12], contrast [5] 
and diverse [11] became promising tasks. Various algorithms has been proposed 
as a general framework for pattern mining [6], [4] in last few years. Many 
languages have been developed for declaratively modeling problems, such as 
Zinc [9], Essence [3], Gecode [13] and Comet [6], [7]. 

To search and prune the solution space, most of these methods use systematic 
search methods and the algorithms, those are not only exhaustive in nature but 
also take huge amount of time. On the other hand, stochastic search algorithms 
does not guarantee optimality but give a approximately best results within a 
short period of time. However, Guns et al. [6] investigated a technique by 
simplifying pattern set mining tasks and search strategies by putting these into 
a common declarative framework. In a recent work, Hossain et al. [8] explored 
the use of genetic algorithms and other stochastic local search algorithms to 
solve the concept learning task using small datasets. 
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4 Our Approach 

In this section, first we describe our proposed genetic algorithm to solve the 
diverse pattern set problem. Then we describe the other algorithms that we 
implemented in order to compare with our algortihm. 

4.1 Genetic algorithm 


Algorithm 1 geneticAlgorithm() 

1 : p = populationSize 
2: percentChange = 90 
3: V = generate p valid pattern sets 
4: n = {} 

5: while timeout do 
6: Vm = simpleMutation(7^) 

7: Vc = uniformCrossOver(7^) 

8: = select best {V U Vm U Vc) 

9: if 7^* remains same for 100 iteration then 

10: n = findBest(7^h) 

11: Vb = VbU {ni 

12 : V* = changePopvla.tion{ percentChange, Vt) 

13: end if 
14: V = V* 

15: end while 

16: n* = findBest(7^t,) 

17: retnrn 


Genetic algorithms are inspired by natural selection process. The search im¬ 
proves from generation to generation of a population of individuals by means of 
mutation and crossover. We have used XOR operation to generate our objective 
score as described in the preliminaries section. 

In initialization part, we randomly generated p valid pattern sets and kept 
it in V. To generate a valid pattern set we noticed that the itemsets have a 
particular structure. There are several exclusive attributes which are not true 
at a time. To avoid such invalid situations we used a constrained initialization 
for the representation. 

Then we created population in Vm and Vc- Vm created a population us¬ 
ing mutation(shown in Algorithm 2) and Vc created a population using cross 
over (shown in Algorithm 3). After that we took best population from V, Vm 
and Vc into V*. Here, size of V* will be same as population size. We have 
iterated the procedure over and over again through several generations. If Vs, 
remains same for at least 100 generations, we changed the value of Vs, using 
simpleMutation(Pattern5'efs7^) (shown in Algorithm 2). This way we won’t 
stuck in local minima. Here, We saved the maximum diverse pattern set in Vb 
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every time. Then we copied 7^*’s value in 'P. In the next generation, we got 
a new population. We continued this procedure until timeout. After that we 
returned the best score from Vb- 

We have checked the effect of population in result using tic-tac-toe dataset. 
We have found that population size plays a pivotal role for generating result. 
We have described about this in analysis section elaborately. 


Algorithm 2 simpleMutation( PatternSets V) 

1: index = 0 

2 : Vm = {} 

3: size = noOfPatternset(P) 

4: while index < size do 
5: n = V[index\ 

6: = generate a valid neighbor of H by flipping single bit 

7: while ^ do 

rim = generate a valid neighbor of H by flipping single bit 
9: end while 

10: Pm = Pm U { Jim} 

11: index + + 

12: end while 
13: return Vm 


Using simpleMutation(Pattern5'e<sP), we have created p new pattern sets 
by mutation. We have generated pattern sets randomly by changing a single bit. 
While doing the mutation, we always kept the structure constraint satisfied. 


Algorithm 3 crossOver( PatternSets P) 

1: index = 0 
2: Pc = {} 

3: size = noOfPatternset(P) 

4: while index <size do 

5: Wm ~ randomly take a pattern set from P 

6: Jly = randomly take a pattern set from P 

7: Ho = uniformCrossOver( ) 

8: while e Pc do 

9: Ho = uniformCrossOver( Hm’ 11/ ) 

10: end while 

11: Tie = Pc U {ricl 

12: index + + 

13: end while 
14: return Pc 


Using crossover (shown in Algorithm 3), we have taken two pattern sets 
from population to create an offspring. We have done this for p times where p 
is the number of population. Now we have p offspring. We have used uniform 
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crossover to find the offspring. We have randomly chosen each item from these 
two pattern set and place them into new pattern sets but we have made sure 
that no duplicate remains in new population. The structure constraint is also 
satisfied during crossover. 


Algorithm 4 changePopulation(perC'/ian 5 e, PatternSets V) 

1: noOfchange = {perChange * sizeOf(7^))/100 
2: remove lowest noOf change from V 

3: i = 1 

4: while i < noOf change do 

5: ri = randomly create a valid patten set with /c-items 

6: while e 7^ do 

7: ri ~ randomly create a valid patten set with fc-items 

8 : end while 

9: V = VU{1\} 

10: i + + 

11: end while 
12: return V 


To avoid getting stuck in local minima, we have used random restart in our 
genetic algorithm. When list of population aren’t change for a certain period, 
we restarted the algorithm based on two variable. One, when it will be restarted, 
and second, how much change will be done in the list. changePopulation(percen<C'/ian(/e, V) 
(shown in Algorithm 4) is used to create a new population where V represents 
the pattern set in which we have to change. percentChange represents how 
much patters that we have to change. For example if percentChange = 90, 
that means 90% value will be deleted to create new value. In our algorithm, 
we experimented with different values of percentChange. We have found that 
when percentChange = 90, we have always got good results. As it saves only 
top 10% score and other 90% will be used to create new population. 

In our Algorithm, we never allowed it to have twin in any population. Before 
entering any pattern sets, we have checked that if it is twin or not. When it was 
already in there, we rejected it and created new one. We have done this until 
found a distinct valid pattern set. 

To find the objective score for a pattern set, we found coverage of each 
itemset. This will return some boolean array. After that we found all the 
combination for those boolean array. Now for each combination, we used XOR 
operator and added all the values. 

4.2 Large Neighborhood Search(LNS) 

A large neighborhood search (LNS) is also implemented following the imple¬ 
mentation in [6]. For LNS (shown in Algorithm 5), we first created a valid 
pattern set and found its score. Then we created its neighbors and found the 
best neighbor. If best neighbor is greater than the initial pattern set then we 
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Algorithm 5 largeNeigbourhoodSearch() 

1: noOf BUToChange = 1 

2: = randomly create a valid patten set with fc-items 

3: while timeout do 

4 : V = create 2 "°o/B*tToCftanse neighbours for J] 

5: Jl* = find best individual from V 

6: if getObjectiveScore( ) > getObjectiveScore( H ) then 

7: n = n* 

8 : end if 

9: if n remains same for 100 iteration then 

10: noOfBitToChange + + 

11: end if 

12: end while 
13: return 


changed the initial pattern set and replaced it with best neighbor. In our imple¬ 
mentation, the number of neighbors created for a pattern set will be 2" where 
n = noOf BitToChange. When we generated the neighbors, at first we created 
2^ neighbor with n = 1. If it didn’t give good results for 100 iteration, we in¬ 
cremented the value of n by 1. We perform this again and again whenever LNS 
stuck for 100 iteration. To crate neighbors of a pattern set, we randomly choose 
an itemset from that pattern set. After that we randomly choose an item from 
that itemset. We do this for n times as each item is represented by boolean 
values so if we creates all posssible neighbors for three items then number of 
neighbors for changing three items will be 2^. So, for n, it will be 2”. 


Algorithm 6 hillClimbing() 

1: = randomly create a valid patten set with /c-items 

2 : bestScore = getObjectiveScore( ) 

3: while timeout do 

4: ri = generate a valid neighbor from 

5: currentScore = getObjectiveScore( ) 

6: if currentscore > bestScore then 

7: n* = n 

8: bestScore = currents core 

9: end if 

10: end while 
11: return 


4.3 Hill Climbing with Single Neighbor 

For hill climbing (shown in Algorithm 6), we created a valid pattern set and 
copied the value of it in another pattern set called H- We started a loop which 






run for 1 minute. Then we created a neighbor of in If this new neighbor 
is greater than the we copied the value of new neighbor in and created 

a new neighbor of The cycle goes on until the time is up. 


Algorithm 7 randomWalk() 

1: bestScore = —oo 

2: n* = 

3: while timeout do 

4: n = randomly create a valid patten set with /c-items 

5: currentScore = getObjectiveScore( H ) 

6: if currentscore > bestScore then 

7: n* = n 

8: bestScore = currents core 

9: end if 

10: end while 
11: return 


4.4 Random Walk 

In random walk (shown in Algorithm 7), we created a valid pattern set J([. Then 
we created another pattern set called H*- We copied the value of H into H*- 
Then we started a loop which run for 1 minute. Here, we changed the n by 
creating a new valid pattern set and then checked the value with J([*. If the 
score of J([ is greater, we copied J([ into J([*. Then again we changed n by 
creating another pattern set randomly. This procedure is worked for 1 minute. 
After that we took the score of J([*. 

5 Experimental Results 

We have implemented all algorithms in JAVA language and have run our ex¬ 
periments on an Intel core i3 2.27 GHz machine with 4 GB ram running 64bit 
Windows 7 Home Premium. 

Table 2: Description of datasets. 


Data Set 

Items 

Transactions 

Tic-tac-toe 

27 

958 

Primary-tumor 

31 

336 

Soybean 

50 

630 

Hypothyroid 

88 

3247 

Mushroom 

119 

8124 
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Table 3: Objective score achieved by different algorithms for various datasets 
with different sizes of pattern sets k. 


Data set 

Pattern set size 
k 

Search Algorithm 

Random walk 

Hill Climbing 

1 LNS 

Genetic Algorithm 

Avg. 

Best 

Avg. 

Best 

Avg. 

Best 

Avg. 

Best 


2 

771 

798 

516.8 

753 

762 

798 

798 

798 


3 

1491.4 

1690 

1432.2 

1593 

1825.6 

1916 

1916 

1916 

Tic-tac-toe 

6 

5355 

5380 

7004.4 

7653 

7758 

7791 

7938 

7938 


9 

17517.6 

18224 

15977.6 

16972 

18097.6 

17858 

18458.4 

18624 


10 

11393.8 

12764 

19963 

21496 

22235.2 

22748 

22731.4 

22816 


2 

3388 

4936 

0 

0 

1362.4 

6812 

8124 

8124 


3 

6889.6 

14576 

3249.6 

16248 

2070.4 

10352 

16248 

16248 

Mushroom 

6 

27260 

37440 

0 

0 

0 

0 

58734 

64992 


9 

33955.2 

43216 

20960 

63392 

0 

0 

103932 

142452 


10 

34117.2 

46584 

28868.4 

73116 

0 

0 

107529.6 

130944 


2 

439.6 

562 

324.4 

1622 

649.4 

3247 

2736.4 

3247 


3 

937.2 

1484 

0 

0 

0 

0 

5876 

6494 

Hypothyroid 

6 

2277 

3405 

0 

0 

0 

0 

12549.4 

16325 


9 

3732.8 

5864 

0 

0 

5193.6 

25968 

24234.8 

27556 


10 

5916.6 

9333 

11689.2 

29223 

0 

0 

17629.8 

21726 


2 

624 

624 

0 

0 

374.5 

624 

630 

630 


3 

1242.4 

1248 

260.4 

1136 

1168.8 

1248 

1260 

1260 

Soybean 

6 

3155 

3438 

3304.2 

5076 

4023.8 

4992 

5642.8 

5664 


9 

5246.8 

5778 

3770 

5634 

11113.6 

12568 

12547.2 

12598 


10 

6409 

7597 

9406.2 

12000 

7653.8 

12090 

15531.2 

15696 


2 

326.4 

329 

238 

336 

334.6 

336 

336 

336 


3 

647.6 

658 

540.4 

672 

672 

672 

672 

672 

Primary-tumor 

6 

2115.8 

2453 

2944 

3017 

3001.4 

3018 

3013.6 

3024 


9 

3833.2 

4372 

6616.4 

6710 

6682 

6712 

6715.2 

6720 


10 

4539 

4897 

7576.2 

8336 

8343.4 

8393 

8351.4 

8376 


5.1 Dataset 

In this paper, the datasets that we use are taken from UCI Machine Learning 
repository [2] and originally used in [6]. The datasets are available to download 
freely from the website: https://dtai.cs.kuleuven.be/CP4IM/datasets/. The 
datasets are given in Table 2 with their properties. 

5.2 Results 

In our experiment, we have implemented four algorithms. We have calculated 
the objective score for each algorithm. For each algorithm, we have used five 
datasets whose transaction number and item size can be found in Table 2. We 
have used k pattern sets in each of them where k = 2,3, 6, 9,10. We have run 
each of them for 1 minute and collected the score. For each test case, we have 
run the code five times and took its best score and average score. Which can be 
found in Table 3. We have found that almost all time genetic algorithm works 
better than other algorithms. In few cases, LNS works better as same as genetic 
algorithm. Random walk performs poorly. However, in few cases, hill climbing 
works better. 
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5.3 Analysis 

When number of itemset becomes greater, genetic algorithm prevails. In genetic 
algorithm, population size have to be in a limit. Too less or too many will give a 
bad result. Using random restart in genetic algorithm, changing 90% population 
will work better. 

Fig. 1 shows the effect of population size for the dataset tic-tac-toe. We 
examined with different population size from 10 — 2000. For each population, 
we ran the code five times and took best and average score. In X-axis, we put 
the population size and Y-axis we put the objective score. Fig. 1(a) shows the 
average of objective score. In this figure, we can see that when population size is 
in 40 — 500 it’ll give the best answer. After that when population size is exceed 
500, the objective score will decrease. Fig. 1(b) shows the best score. In this 
figure, we can see that when population size is in 10 — 1000 it’ll give the best 
answer. After that when population size is exceed 1000, the objective score will 
decrease. So, we can conclude that genetic algorithm works more better with 
respect to population size but when the size of population is small or big, we 
didn’t get feasible answer in our allocated time since the calculations become 
too expensive. 

Fig. 2 shows the performance of the search algorithms base on their average 
objective score, which are shown as vertical bars, in 1 minute for all the datasets 
for different pattern set sizes. Here, genetic algorithm always gives good result 
with respect to other algorithms. Sometimes LNS gives good result as same as 
genetic algorithm. For the datasets mushroom and hypothyroid, the objective 
score of LNS and hill climbing becomes zero because the size of the items of 
the datasets (shown in Table 2) is too big. From Fig. 2 we also shows that hill 
climbing performs better than random walk which performance is very poor. 

In Fig. 3, we depict the performance of different search algorithms for the 
tic-tac-toe dataset. In this figure, objective score of the search algorithms are 
shown as vertical line for different times. Random walk performs poorly as 
usual. However, hill climbing improves very quickly using single neighbor. LNS 
performs very well which result is near to genetic algorithm. However, genetic 
algorithm continuous gives best result. 




(a) Average (b) Best 

Figure 1: Search progress for genetic algorithm for the tic-tac-toe dataset with 
pattern size k = 6. 
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Tic-Tac-Toe 


Mushroom 



Pattern Set Size, k 


Pattern Set size, k 



2 3 6 9 10 


Pattern Set Size, k 


Hypothyroid 



2 3 6 9 10 


Pattern set Size, k 


Primary-tumor 

lOOOO T- 



2 3 6 9 10 

Pattern Set Size, k 


■ randcanwalk ahillclimbing sins ageneticalgorlthm 


Figure 2: Bar diagram showing comparison of average objective score achieved 
by different algorithms for various sizes of pattern sets, k = 2,3, 6,9,10. 


6 Conclusion 

In this paper, we proposed a new genetic algorithm by tweaking (using random 
restart and twin removal along with mutation and crossover) to solve the task of 
mining diverse pattern sets. Here, genetic algorithm shows good results within 
a short period of time with compared to other algorithms. In future, we would 
like to improve the performance of the search techniques for genetic algorithm 
for large population size within the framework of stochastic local search and 
solve pattern set mining related problems with realistic datasets. 
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